Arabic Plagiarism Detection Using Word Correlation in N-Grams with K-Overlapping Approach, Working Notes for PAN-AraPlagDet at FIRE 2015
نویسنده
چکیده
This report explains our Arabic plagiarism detection system which we used to submit our run to AraPlagDetect competition at FIRE 2015. The system was constructed through four main stages. First is pre-processing which includes tokenisation and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using K-gram fingerprinting and Jaccard coefficient. Suspicious documents are then compared indepth with the associated candidate documents. This stage entails the computation of the similarity between constructed N-grams with K-overlapping where N and K were experimentally assigned to 8 and 3, respectively. The similarity between N-Gram pairs were computed based on word correlations. Each word was compared with words in candidate N-Gram and correlated by 1 if they are matched. Correlation values were averaged then compared to a threshold. The last step is post-processing whereby consecutive N-Grams were joined to form united plagiarised segments. Our performance measures on the training corpus were encouraging (recall=0.829, precision=0.843, granularity=1.11). The recall measure on the test collection was unfortunately less (recall= 0.530) but precision and granularity remained consistent with the train set (precision= 0.831, granularity= 1.18). This drop in recall may be due to the fact that our candidate retrieval stage retrieves only documents which share copied fragments but there exist plagiarised documents which have no exact-copied cases. Although this system can detect some means of obfuscation such as restructuring or rewording of few phrases, it might not work with handmade paraphrases. Our future work is to advance the candidate retrieval stage and contain semantic-based metrics in the detection stage.
منابع مشابه
Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection
AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two subtasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been submitted and tested on the standardized corpora developed for the track. This overview paper describes these evaluation corpora, discusses the participants’ m...
متن کاملRDI System for Extrinsic Plagiarism Detection (RDI_RED), Working Notes for PANAraPlagDet at FIRE 2015
Extrinsic plagiarism detection gathered the attention of many researchers lately. Plagiarism process began to be more and more difficult to be detected due to appearance of other sophisticated plagiarism approaches other than direct copy and paste such as (phrase rephrasing, word shuffling, semantic substitution, etc...). In this paper, we present RDI system for extrinsic plagiarism detection (...
متن کاملRDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PANAraPlagDet at FIRE 2015
Many researchers have been investigating the task of plagiarism detection lately. In this paper we present RDI system for intrinsic plagiarism detection (RDI_RID). RDI_RID system was the only system that participated in intrinsic track of the Arabic language plagiarism detection competition. RDI_RID system achieved a PlagDet (Plagiarism Detection score) of 19% compared to 38% achieved by the ba...
متن کاملUniversity of Sheffield - Lab Report for PAN at CLEF 2010
This paper describes the University of Sheffield entry for the 2nd international plagiarism detection competition (PAN 2010). Our system attempts to identify extrinsic plagiarism. A three-stage approach is used: pre-processing, candidate document selection (using word n-grams) and detailed analysis (using the Running Karp-Rabin Greedy String Tiling string matching algorithm). This approach achi...
متن کاملUsing a Variety of n-Grams for the Detection of Different Kinds of Plagiarism Notebook for PAN at CLEF 2013
A text can be plagiarised in different ways. The text may be copied and pasted word by word, parts of the text may be changed, or the whole text may be summarised into one or two lines. Different kinds of plagiarism require different strategies to detect them. But rarely do we know beforehand what type of plagiarism we are dealing with. In this paper we present a system that can detect verbatim...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015